Image Similarity Search


Image Similarity - Yet Another Search And Sort

This is an experiment that came about from studying how to identify duplicate khipus. One of the insights in that study was that visual similarity worked. Somewhat. So then the question suggests itself: can you sort and search on image similarity? Answer! Of course you can!

There are a number of approaches to solve that problem. Since this is a prototype experiment, I picked an off the shelf package, ran it, and then reverse engineered the output, as shown below. The input sources to the package were the small image-quilt thumbnails used in the Khipu Field Guide. The package I used, pixplot, does a k-means clustering using the pixel “features” in the image, then projects the image onto the 2d plane, using tSNE as the dimensionality reduction algorithm. The final result, known as a raster-fairy is a jittered result that fits into a grid like format.

At the bottom of the page I show the results, and how effective that experiment was.

Reverse Engineering the Output of PixPlot

Code
import json
# Initialize plotly
plotly.offline.init_notebook_mode(connected = False)


pixplot_directory = f"{ku.project_directory()}/data/pixplot/output"
imagelist_file = f"{pixplot_directory}/data/imagelists/imagelist.json"
with open(imagelist_file) as f:
  image_data_dict = json.load(f)
 
# parse and load the data
images = image_data_dict['images']
image_names = [x.replace("_wide.jpg","") for x in images]
image_sizes = image_data_dict['cell_sizes'][0]
image_atlas = image_data_dict['atlas']
image_positions = image_atlas['positions']
Code
atlas_points = [(item[0], item[1]) for item in image_positions[0]]
named_atlas_points = [(name, position[0], position[1]) for (name, position) in  zip(image_names, atlas_points)]
named_atlas_points_df = pd.DataFrame(named_atlas_points, columns=['name', 'x', 'y'])

fig = (px.scatter(named_atlas_points_df, x="x", y="y", 
                 hover_name='name', hover_data=['x', 'y'], 
                 title=f"<b>Image Similarity with Unknown Layout</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show()
      )

Not what I want.

Code
grid_file = f"{pixplot_directory}/data/layouts/grid_layout.json"
with open(grid_file) as f:
  grid_data_dict = json.load(f)

grid_points = [(item[0], item[1]) for item in grid_data_dict]
named_grid_points = [(name, position[0], position[1]) for (name, position) in  zip(image_names, grid_points)]
named_grid_points_df = pd.DataFrame(named_grid_points, columns=['name', 'x', 'y'])

fig = (px.scatter(named_grid_points_df, x="x", y="y", 
                   hover_name='name', hover_data=['x', 'y'], 
                   title=f"<b>Image Similarity with Unknown Layout</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show()
      )

Also, not what I want.

Code
rasterfairy_file = f"{pixplot_directory}/data/layouts/rasterfairy_layout.json"
with open(rasterfairy_file) as f:
  rasterfairy_data_dict = json.load(f)

grid_x_space = round(1./(1-.9166666))
grid_y_space = round(1./(1-.92308))
rasterfairy_points = [(round(item[0]*grid_x_space)+12, round(item[1]*grid_y_space)+13) for item in rasterfairy_data_dict]
named_rasterfairy_points = [(name, position[0], position[1]) for (name, position) in  zip(image_names, rasterfairy_points)]
named_rasterfairy_points_df = pd.DataFrame(named_rasterfairy_points, columns=['name', 'x', 'y'])

fig = (px.scatter(named_rasterfairy_points_df, x="x", y="y", 
                   hover_name='name', hover_data=['x', 'y'], 
                   title=f"<b>Image Similarity in Projected 2D Space (using UMAP/tSNE) Followed by a Gridded Deformation</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show()
      )
named_rasterfairy_points_df
name x y
0 AS001 19 14
1 AS002 19 1
2 AS003 23 11
3 AS004 18 0
4 AS005 13 17
... ... ... ...
654 UR290 24 20
655 UR291A 0 7
656 UR292A 10 22
657 UR293 23 7
658 UR294 22 11

This provides an equivalent of what I need.

Building an HTML Table version of PixPlot’s RasterFairy

Code
max_x = max(named_rasterfairy_points_df.x.tolist())
max_y = max(named_rasterfairy_points_df.y.tolist())
table_dict = {(x,y):"" for x in range(max_x+1) for y in range(max_y+1)}
for row_num in range(len(named_rasterfairy_points_df)):
    record = named_rasterfairy_points_df.iloc[row_num]
    table_dict[(record['x'], record['y'])] = record['name']
#table_dict
Code
def names_in_row(y_index): return [item[1] for item in table_dict.items() if item[0][1] == y_index]
def thumbnail_image_size(aName): return image_sizes[image_names.index(aName)]

def rasterfairy_table_repr():
    table_rep = "<table style=\"position:relative; left:40px; top:200px;padding-bottom:30px;\" >"
    for row_num in range(max_y,-1,-1):
        row_names = names_in_row(row_num)
        table_rep += rasterfairy_table_row_repr(row_names)
    table_rep += "</table>"
    return table_rep

def rasterfairy_table_row_repr(khipu_names):
    row_rep = "<tr>"
    for name in khipu_names:
        row_rep += f"<td>{rasterfairy_table_cell_repr(name)}</td>"
    row_rep += "</tr>"  
    return row_rep

row_height = "60"
def rasterfairy_table_cell_repr(khipu_name):
    if khipu_name and (not khipu_name.startswith("XX")):
        tile_left,tile_top = (0 + 50, 0 + 170)
        tile_string = f"<div style=\"border: 0px solid #DDD; height: {row_height}px;text-align:center;vertical-align:middle;\">"
        tile_string += f"<div class=\"khipu_title\">{khipu_name}</div>\n"
        tile_string += f"<a target=\"_blank\" href=\"https://www.khipufieldguide.com/sketchbook/khipu/{khipu_name}_sketch.html\">" 
        image_size = thumbnail_image_size(khipu_name)
        tile_string += f"<img style=\"border: 1px solid #000;\" alt=\"{khipu_name}\" align=\"center\" width=\"{round(image_size[0]/3)}px;\" src= \"https://www.khipufieldguide.com/sketchbook/images/thumbnails/{khipu_name}_wide.jpg\"/>\n"
        tile_string += "</a></div>\n"
    else:
        tile_string = f"<div style=\"border: 0px solid #FFE; height: {row_height}px;text-align:center;vertical-align:middle;\"></div>\n"
    return tile_string

#rasterfairy_table_repr()

def make_rasterfairy_quilt_file(title_string="The Khipu Image Similarity Browser",
                                output_filename = "rasterfairy_imagequilt"):
    
    sort_string = "Sorted by Image Similarity" 
    quilt_string = rasterfairy_table_repr()

    template_file = "khipu_image_quilt_template"
    html_template_filename = f"{ku.project_directory()}/code/templates/{template_file}.html"
    
    output_dir = f"{ku.project_directory()}/notebook/fieldmarks/appendix/notebooks"
    rasterfairy_quilt_html_filename = f"{output_dir}/{output_filename}.html"
    with open(html_template_filename, 'r') as reader:
        html_template = reader.read()

    html_template = html_template.replace("____TITLE_STRING____", title_string)
    html_template = html_template.replace("____SORTED_BY____", sort_string)
    html_template = html_template.replace("____QUILT_TABLE____", quilt_string)
    with open(rasterfairy_quilt_html_filename, 'w') as f: f.write(html_template)
    
    os.system(f"open -a 'Brave Browser' '{rasterfairy_quilt_html_filename}'")
    
make_rasterfairy_quilt_file()

The RasterFairy Quilt

The RasterFairy Quilt


The quilt deforms the projection so that every khipu occupies a unique spot on a grid, the eponymous raster-fairy layout. Interestingly, it appears that the first level sort is very simple - width vs height. From there on it uses additional pixel level features (edge detection, etc.) to sort the images by closeness.

Evaluation of How well the RasterFairy did.

The following pairs were identified in a previous appendix as duplicate khipus, by Manuel Medrano. We can compute the distance for the grid points for each pair to see how well the algorithm did at grouping duplicates.

  • UR035/AS70
  • UR043/AS30
  • UR083/AS208
  • UR126/UR115
  • UR133/UR036
  • UR163/AS056
  • UR176/LL01
  • UR236/AS181
  • UR281/AS068
  • HP041/AS046

As a baselines, let’s use the KFG Similarity index, created by a Hierarchical Clustering approach on textual documents describing the khipu.

Code
matching_pairs = [['UR035', 'AS070'], 
                  ['UR043', 'AS030'], 
                  ['UR083', 'AS208'], 
                  ['UR126', 'UR115'], 
                  ['UR133', 'UR036'], 
                  ['UR163', 'AS056'], 
                  ['UR176', 'LL01'], 
                  ['UR236', 'AS181'], 
                  ['UR281', 'AS068'], 
                  ['HP041', 'AS046']]


khipu_summary_df = kq.fetch_khipu_summary()
khipu_similarity = {x[0]:x[1] for x in zip(khipu_summary_df['name'], khipu_summary_df['hierarchical_cluster_pos'])}

print(ku.box_text_msg("HIERARCHICAL CLUSTERING PAIR DISTANCES"))
hc_distances = []
for pair in matching_pairs:
    (pos1,pos2) = (khipu_similarity[pair[0]], khipu_similarity[pair[1]])
    pair_distance = abs(pos1 - pos2)
    hc_distances.append(pair_distance)
    print(f"{pair}: {pair_distance}")
print(f"-------------------------\nMean Hierarchical Cluster distance is {statistics.mean(hc_distances)}")
+------------------------------------------------------------------------------+
|HIERARCHICAL CLUSTERING PAIR DISTANCES                                        |
+------------------------------------------------------------------------------+
['UR035', 'AS070']: 1
['UR043', 'AS030']: 68
['UR083', 'AS208']: 134
['UR126', 'UR115']: 1
['UR133', 'UR036']: 1
['UR163', 'AS056']: 297
['UR176', 'LL01']: 1
['UR236', 'AS181']: 310
['UR281', 'AS068']: 57
['HP041', 'AS046']: 122
-------------------------
Mean Hierarchical Cluster distance is 99.2

Now let’s use visual grid matching and see how that changes the mean distance:

Code
# Use this color for later...
plotly_color_dict = {name:"#ffffaa" for name in image_names}
matching_list = []
for pair in matching_pairs:
    cell_color = ku.cycled_color_12()
    plotly_color_dict[pair[0]] = cell_color
    plotly_color_dict[pair[1]] = cell_color
    matching_list.append(pair[0])
    matching_list.append(pair[1])
matching_list = set(matching_list)

grid_position_dict = {}
for row_num in range(len(named_rasterfairy_points_df)):
    record = named_rasterfairy_points_df.iloc[row_num]
    grid_position_dict[record['name']] = (record['x'], record['y'])

print(ku.box_text_msg("RASTERFAIRY GRID PAIR DISTANCES"))
grid_distances = []
for pair in matching_pairs:
    (pos1,pos2) = (grid_position_dict[pair[0]], grid_position_dict[pair[1]])
    dx, dy = (pos2[0]-pos1[0], pos2[1]-pos1[1])
    pair_distance = round(math.sqrt(dx*dx + dy*dy))
    grid_distances.append(pair_distance)
    print(f"{pair}: {pair_distance}")
print(f"Mean grid distance is {statistics.mean(grid_distances)}")


named_grid_points = [(name, plotly_color_dict[name], 5 if name in matching_list else 1, position[0], position[1]) for (name, position) in grid_position_dict.items()]
named_grid_points_df = pd.DataFrame(named_grid_points, columns=['name', 'color', 'match_size', 'x', 'y'])
point_colors = [plotly_color_dict[name] for name in image_names]
fig = (px.scatter(named_grid_points_df, x="x", y="y", color="color", size="match_size",
                  hover_name='name', hover_data=['x', 'y'], 
                  title=f"<b>Image Similarity in Projected 2D Space (using RasterFairy Grid) - Big Circles are Matching Pairs</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False, paper_bgcolor='#fffff8')
        .update(layout_coloraxis_showscale=False)
        .show()
      )
+------------------------------------------------------------------------------+
|RASTERFAIRY GRID PAIR DISTANCES                                               |
+------------------------------------------------------------------------------+
['UR035', 'AS070']: 8
['UR043', 'AS030']: 11
['UR083', 'AS208']: 1
['UR126', 'UR115']: 1
['UR133', 'UR036']: 2
['UR163', 'AS056']: 1
['UR176', 'LL01']: 4
['UR236', 'AS181']: 5
['UR281', 'AS068']: 1
['HP041', 'AS046']: 14
Mean grid distance is 4.8

A major improvement! A good algorithm but not a great one…

Let’s use the original tSNE/UMAP data instead of the grid and see if that improves it..

Code
# Use this color for later...
plotly_color_dict = {name:"#ffffaa" for name in image_names}
matching_list = []
for pair in matching_pairs:
    cell_color = ku.cycled_color_12()
    plotly_color_dict[pair[0]] = cell_color
    plotly_color_dict[pair[1]] = cell_color
    matching_list.append(pair[0])
    matching_list.append(pair[1])
matching_list = set(matching_list)

umap_file = f"{pixplot_directory}/data/layouts/umap_jittered_layout.json"
with open(umap_file) as f:
  umap_data_dict = json.load(f)

grid_x_space = round(1./(1-.9166666))
grid_y_space = round(1./(1-.9166666))
umap_points = [(round(item[0]*grid_x_space)+12, round(item[1]*grid_y_space)+12) for item in umap_data_dict]

named_umap_points = [(name, plotly_color_dict[name], 5 if name in matching_list else 1, position[0], position[1]) for (name, position) in  zip(image_names, umap_points)]
named_umap_points_df = pd.DataFrame(named_umap_points, columns=['name', 'color', 'match_size', 'x', 'y'])

tsne_position_dict = {}
for row_num in range(len(named_umap_points_df)):
    record = named_umap_points_df.iloc[row_num]
    tsne_position_dict[record['name']] = (record['x'], record['y'])

print(ku.box_text_msg("t-SNE PAIR DISTANCES"))
tsne_distances = []
for pair in matching_pairs:
    (pos1,pos2) = (tsne_position_dict[pair[0]], tsne_position_dict[pair[1]])
    dx, dy = (pos2[0]-pos1[0], pos2[1]-pos1[1])
    pair_distance = round(math.sqrt(dx*dx + dy*dy))
    tsne_distances.append(pair_distance)
    print(f"{pair}: {pair_distance}")
print(f"-------------------------\nMean tsne_distance is {statistics.mean(tsne_distances)}\n\n")
print("")

point_colors = [plotly_color_dict[name] for name in image_names]
fig = (px.scatter(named_umap_points_df, x="x", y="y", color="color", size="match_size",
                  hover_name='name', hover_data=['x', 'y'], 
                  title=f"<b>Image Similarity in Projected 2D Space (using UMAP/tSNE) - Big Circles are Matching Pairs</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False, paper_bgcolor='#fffff8')
        .update(layout_coloraxis_showscale=False)
        .show()
      )
+------------------------------------------------------------------------------+
|t-SNE PAIR DISTANCES                                                          |
+------------------------------------------------------------------------------+
['UR035', 'AS070']: 5
['UR043', 'AS030']: 8
['UR083', 'AS208']: 0
['UR126', 'UR115']: 0
['UR133', 'UR036']: 0
['UR163', 'AS056']: 0
['UR176', 'LL01']: 2
['UR236', 'AS181']: 2
['UR281', 'AS068']: 4
['HP041', 'AS046']: 10
-------------------------
Mean tsne_distance is 3.1


Unsurprisingly, it’s better. It’s still bad where the above approach is bad, but overall the 2D projection is better in accuracy than the deformed grid. It may even be better in the original unprojected space, but still bad where the above approach is bad!

In the above graph, not all khipus are visible, since some khipus lie on top of each other after the 2D tSNE projection from n-dimensional space. For example, UR083 and AS208 (the lime-green circle) lie on top of each other.

Reviewing a RasterFairy Grid of Duplicates

Let’s redraw the quilt showing the matches to see how well they match visually, and what influences their placement in the rasterfairy quilt.

Code
def rasterfairy_table_cell_repr(khipu_name):
    if khipu_name and (not khipu_name.startswith("XX")):
        tile_color = plotly_color_dict[khipu_name]
        is_matching_khipu = khipu_name in matching_list
        tile_left,tile_top = (0 + 50, 0 + 170)
        border_style = f"border:3px solid {tile_color};" if is_matching_khipu else "border:0px solid #DDD;"
        tile_string = f"<div style=\"{border_style} height:{row_height}px;text-align:center;vertical-align:middle;\">"
        tile_string += f"<div class=\"khipu_title\">{khipu_name}</div>\n"
        tile_string += f"<a target=\"_blank\" href=\"https://www.khipufieldguide.com/sketchbook/khipu/{khipu_name}_sketch.html\">" 
        image_size = thumbnail_image_size(khipu_name)
        tile_string += f"<img style=\"border: 1px solid #000;\" alt=\"{khipu_name}\" align=\"center\" width=\"{round(image_size[0]/3)}px;\" src= \"https://www.khipufieldguide.com/sketchbook/images/thumbnails/{khipu_name}_wide.jpg\"/>\n"
        tile_string += "</a></div>\n"
    else:
        tile_string = f"<div style=\"border: 0px solid #FFE; height: {row_height}px;text-align:center;vertical-align:middle;\"></div>\n"
    return tile_string

make_rasterfairy_quilt_file(title_string="The Duplicate Khipu Image Browser",
                            output_filename = "rasterfairy_duplicate_imagequilt")

The Duplicate RasterFairy Quilt